-
Notifications
You must be signed in to change notification settings - Fork 51
feat(trainer): Support namespaced TrainingRuntime in the SDK #130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(trainer): Support namespaced TrainingRuntime in the SDK #130
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
24f00a7 to
1740535
Compare
|
/ok-to-test |
Signed-off-by: Moeed Shaik <[email protected]>
Signed-off-by: Moeed Shaik <[email protected]>
8f0b6d5 to
de2ad1b
Compare
Signed-off-by: Moeed Shaik <[email protected]>
|
Thank you @shaikmoeed for this! |
|
|
||
| def get_runtime(self, name: str) -> types.Runtime: | ||
| """Get the the Runtime object""" | ||
| """Get the the Runtime object prefer namespaced, fall-back to cluster-scoped""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| """Get the the Runtime object prefer namespaced, fall-back to cluster-scoped""" | |
| """Get the Runtime object prefer namespaced, fall-back to cluster-scoped""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same change goes for each occurence here
| ) | ||
|
|
||
| cluster_runtime_list = models.TrainerV1alpha1ClusterTrainingRuntimeList.from_dict( | ||
| cluster_thread.get(constants.DEFAULT_TIMEOUT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| cluster_thread.get(constants.DEFAULT_TIMEOUT) | |
| cluster_thread.get(common_constants.DEFAULT_TIMEOUT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! Reason for test case failures!
| def create_training_runtime( | ||
| name: str, | ||
| namespace: str = "default", | ||
| ) -> models.TrainerV1alpha1TrainingRuntime: | ||
| """Create a mock namespaced TrainingRuntime object (not cluster-scoped).""" | ||
| return models.TrainerV1alpha1TrainingRuntime( | ||
| apiVersion=constants.API_VERSION, | ||
| kind="TrainingRuntime", | ||
| metadata=models.IoK8sApimachineryPkgApisMetaV1ObjectMeta( | ||
| name=name, | ||
| namespace=namespace, | ||
| labels={constants.RUNTIME_FRAMEWORK_LABEL: name}, | ||
| ), | ||
| spec=models.TrainerV1alpha1TrainingRuntimeSpec( | ||
| mlPolicy=models.TrainerV1alpha1MLPolicy( | ||
| torch=models.TrainerV1alpha1TorchMLPolicySource( | ||
| numProcPerNode=models.IoK8sApimachineryPkgUtilIntstrIntOrString(2) | ||
| ), | ||
| numNodes=2, | ||
| ), | ||
| template=models.TrainerV1alpha1JobSetTemplateSpec( | ||
| metadata=models.IoK8sApimachineryPkgApisMetaV1ObjectMeta( | ||
| name=name, | ||
| namespace=namespace, | ||
| ), | ||
| spec=models.JobsetV1alpha2JobSetSpec(replicatedJobs=[get_replicated_job()]), | ||
| ), | ||
| ), | ||
| ) | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you mean to create this in kubernetes/backend_test.py?
this is not a test function and I believe it should be added to the TrainerClient and propagated to the different backends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to create_train_job, I thought to use create_training_runtime to pass here instead of empty list. Please correct me, if this was not intended?
Signed-off-by: Moeed Shaik <[email protected]>
Signed-off-by: Moeed Shaik <[email protected]>
d1ca707 to
32b18fd
Compare
Signed-off-by: Moeed <[email protected]>
Pull Request Test Coverage Report for Build 19512318478Details
💛 - Coveralls |
Signed-off-by: Moeed Shaik <[email protected]>
|
@abhijeet-dhumal, can you review it again during your free time? |
kramaranya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @shaikmoeed!
I'd like all of us to spend more time on designing this properly, since there are a lot of items to consider
/assign @kubeflow/kubeflow-sdk-team
| except multiprocessing.TimeoutError as e: | ||
| raise TimeoutError(f"Timeout to list {constants.CLUSTER_TRAINING_RUNTIME_KIND}s") from e | ||
| raise TimeoutError( | ||
| "Timeout to list " | ||
| f"{constants.CLUSTER_TRAINING_RUNTIME_KIND}s/{constants.TRAINING_RUNTIME_KIND}s " | ||
| f"in namespace: {self.namespace}" | ||
| ) from e | ||
| except Exception as e: | ||
| raise RuntimeError(f"Failed to list {constants.CLUSTER_TRAINING_RUNTIME_KIND}s") from e | ||
| raise RuntimeError( | ||
| "Failed to list " | ||
| f"{constants.CLUSTER_TRAINING_RUNTIME_KIND}s/{constants.TRAINING_RUNTIME_KIND}s " | ||
| f"in namespace: {self.namespace}" | ||
| ) from e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What If only cluster runtimes exist (no TrainingRuntime CRD)? The entire list_runtimes will fail with RuntimeError, which we don't want, right?
| except multiprocessing.TimeoutError as e: | ||
| raise TimeoutError(f"Timeout to list {constants.CLUSTER_TRAINING_RUNTIME_KIND}s") from e | ||
| raise TimeoutError( | ||
| "Timeout to list " | ||
| f"{constants.CLUSTER_TRAINING_RUNTIME_KIND}s/{constants.TRAINING_RUNTIME_KIND}s " | ||
| f"in namespace: {self.namespace}" | ||
| ) from e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if one type times out and the other succeeds? I think we still should return partial results
| "Timeout to list " | ||
| f"{constants.CLUSTER_TRAINING_RUNTIME_KIND}s/{constants.TRAINING_RUNTIME_KIND}s " | ||
| f"in namespace: {self.namespace}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: cluster scoped runtimes are not really in a namespace
| except Exception as e: | ||
| logger.warning( | ||
| f"Namespaced TrainingRuntime '{self.namespace}/{name}' not found " | ||
| f"({type(e).__name__}: {e}); falling back to cluster-scoped runtime." | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if it was for example TimeoutError? We will still silently fallback to cluster scoped runtimes, right? I would suggest only treating not found / missing CRD as fall back.
|
|
||
| return result | ||
|
|
||
| def get_runtime(self, name: str) -> types.Runtime: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to list_runtimes(), either case will cause a full failure
| # The Kind name for the TrainingRuntime. | ||
| TRAINING_RUNTIME_KIND = "TrainingRuntime" | ||
|
|
||
| # The plural for the ClusterTrainingRuntime. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # The plural for the ClusterTrainingRuntime. | |
| # The plural for the TrainingRuntime. |
| return result | ||
|
|
||
| for runtime in runtime_list.items: | ||
| for runtime in namespace_runtime_list.items + cluster_runtime_list.items: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shaikmoeed
Quick question : What if runtimes with the same name exists in both cluster and namespace scoped ?
IIUC you have implemented namespace scoped priority in get_runtime() where trainingRuntimes get's first priority..
thinking should it be same case for list_runtimes method too ?
And one more thing, In case of list_runtimes it's just appending both even if duplicates comes in..
So for end user how user will be able to know the kind of runtime via list_runtimes's list items ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we introduce kind and namespace(optional) params in Runtime dataclass here :
sdk/kubeflow/trainer/types/types.py
Line 251 in 49a5087
| class Runtime: |
So that
for runtime in runtimes:
print(f"{runtime.name} ({runtime.kind}, ns={runtime.namespace})")
# Output:
# torch-runtime (TrainingRuntime, ns=team-a)
# torch-runtime (ClusterTrainingRuntime, ns=None)
# custom-runtime (TrainingRuntime, ns=team-a)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDYT? @kramaranya @szaher ?
What this PR does / why we need it:
Add support to list/get namespaced TrainingRuntime.
Which issue(s) this PR fixes:
Fixes #88
Checklist: